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The  Person  Response  Curve: 

Fit  of  Individuals  to  Item  Characteristic 
Curve  Models 


The  development  of  group  ability  tests  more  than  50  years  ago  has  enabled 
the  comparison  of  the  total  test  score  of  an  individual  with  the  scores  of  a 
population  norm  group,  thus  allowing  for  more  meaningful  interpretation  of  abil¬ 
ity  estimates  than  can  be  done  with  the  use  of  simple  number-correct  scores. 

For  example,  the  statement  "On  the  XYZ  aptitude  test  John  scored  at  the  73rd 
percentile  of  college  students"  gives  more  information  than  the  statement, "On 
the  ABC  ability  test  Mary  correctly  answered  64  questions  out  of  90,  whereas 
Sam  correctly  answered  33  questions."  Both  examples  have  in  common  the  report 
of  a  person's  test  performance  on  a  specific  dimension  given  in  terms  of  an 
overall  test  score;  but  this  single  summary  score,  while  more  parsimonious  than 
a  description  of  a  testee’s  entire  response  pattern,  may  not  reveal  the  opera¬ 
tion  of  other  factors  on  test-taking  behavior,  such  as  guessing,  anxiety,  cul¬ 
tural  bias,  or  lack  of  motivation.  Thus,  total  scores  on  a  test  do  not  indicate 
whether  that  test  is  inappropriate  for  a  certain  individual  or  group  of  indivi¬ 
duals. 

The  emergence  of  modern  test  theory,  based  on  the  item  characteristic  curve 
(ICC;  Hambleton  &  Cook,  1977;  Lord  &  Novick,  1968),  brings  with  it  the  promise 
of  better  tests  conveying  more  accurate  information  about  testee  ability  levels. 
This  is  partially  accomplished  by  use  of  ability  estimation  procedures  that  take 
into  account  the  testee's  total  response  pattern  in  estimating  ability  levels 
(Bejar  &  Weiss,  1979;  Kingsbury  &  Weiss,  1979a).  These  scoring  methods  can  pro¬ 
vide  individualized  error  bands  around  the  testee's  ability  level  estimates, 
which  indicate  the  precision  of  those  ability  estimates  (e.g.,  Kingsbury  &  Weiss, 
1979b).  Thus,  in  addition  to  providing  methods  designed  to  permit  more  adequate 
test  construction  by  the  use  of  test  information  curves  (Hambleton  &  Cook,  1977), 
ICC  theory  permits  utilizing  more  of  a  testee's  response  pattern  in  order  to  pro¬ 
vide  individualized  estimates  of  precision  for  ability  estimates.  In  addition, 
ICC  theory  also  allows  for  the  development  of  powerful  methods  of  adaptive  test¬ 
ing  for  the  solution  of  many  practical  measurement  problems  (e.g.,  Brown  &  Weiss, 
1977;  Kingsbury  &  Weiss,  1979b;  McBride  &  Weiss,  1976;  Vale  &  Weiss,  1977;  Weiss, 
1973,  1975). 

In  contrast  to  classical  test  theory,  ICC  theory  makes  strong  assumptions 
in  order  to  achieve  its  objectives.  The  major  operational  forms  of  ICC  theory 
assume  1)  local  independence,  2)  unidimensionality,  and  3)  a  specified  shape 
for  the  item  characteristic  curve.  Although  local  independence  cannot  be  di¬ 
rectly  demonstrated,  data  supporting  the  unidimensionality  assumption  in  a 
variety  of  settings  (e.g.,  Bejar,  Weiss,  &  Kingsbury,  1977;  Church,  Pine,  & 

Weiss,  1978;  Martin,  Pine,  &  Weiss,  1978;  McBride  &  Weiss,  1974;  Reckase,  1978) 
lend  indirect  support  to  the  assumption  of  local  independence.  Lord  (1968) 
has  presented  data  showing  that  the  assumption  of  a  normal  ogive  ICC  is  tenable 
and,  given  the  minor  differences  between  a  logistic  ogive  and  a  normal  ogive, 
has  indirectly  supported  the  use  of  the  logistic  item  response  function  in  ICC 
theory. 


There  has  been  very  little  research,  however,  to  demonstrate  that 
individuals  behave  in  accordance  with  the  ICC  model,  although  a  growing  concern 
has  been  exhibited  in  the  testing  literature  for  the  development  of  methods  to 
extract  more  information  from  test  response  data  than  simply  a  total  score.  Use 
of  ICC  models  with  individuals  must  rest  on  a  demonstration  that  the  test  re¬ 
sponses  of  individuals  are  in  accordance  with  the  testing  model  hypothesized. 

If  this  can  be  demonstrated  for  most  individuals  on  a  number  of  tests,  ICC  models 
can  be  used  with  confidence  to  their  full  power.  On  the  other  hand,  if  a  major¬ 
ity  of  individuals  respond  in  ways  contrary  to  ICC  theory,  the  utility  of  the 
theory  for  individual  measurement  can  be  seriously  questioned. 

A  major  advantage  of  the  assumptions  of  ICC  theory  for  individual  measure¬ 
ment  is  that  the  question  of  individuals'  fit  or  non-fit  to  the  model  can  be  in¬ 
vestigated  on  an  individual  basis.  The  practical  implications  of  identifying 
non-fitting  persons  were  realized  by  Educational  Testing  Service  in  their  study 
of  methods  to  identify  response  patterns  of  the  type  of  student  who  "may  be  so 
atypical  and  unlike  other  students  that  his  [or  her]  aptitude  test  score  fails  to 
be  a  completely  appropriate  measure  of  his  [or  her]  relative  ability"  (Levine  & 
Rubin,  1976).  Examples  of  such  students  are  low-ability  examinees  who  copy  an¬ 
swers  to  several  difficult  items  from  a  much  more  able  neighbor  and  very  high- 
ability  examinees  fluent  in  another  language  but  not  yet  fluent  in  English,  who 
misunderstand  the  wording  of  several  relatively  easy  questions.  Levine  and 
Rubin  recommended  the  development  of  indices  to  identify  such  test  item  response 
patterns  as  a  "rich  and  fertile  area  for  future  research." 

The  appropriateness  of  a  certain  test  or  certain  items  for  specific  indi¬ 
viduals  has  also  been  an  important  concern  for  test  developers  working  with  the 
one-parameter  logistic  ICC  (Rasch)  model.  Wright  and  his  associates  (1977;  Mead, 
1979;  Wright  &  Stone,  1979;  Wainer  &  Wright,  in  prep.)  have  proposed  identifi¬ 
cation  of  such  factors  as  guessing,  carelessness,  and  bias, using  the  Rasch  model. 
According  to  Lumsden  (1977),  a  bright  but  careless  student  may  have  the  same 
overall  ability  score  as  a  careful  and  consistent  average  student,  but  there  are 
differential  instructional  implications  for  teaching  these  two  types  of  students 
or  differential  counseling  implications  if  the  two  students  are  seeking  voca¬ 
tional  counseling. 

Thus,  the  question  of  fit  or  non-fit  of  individuals  to  ICC  testing  models 
has  important  practical  and  theoretical  importance.  Fit  of  individuals  must  be 
demonstrated  in  order  to  realize  the  full  potential  of  the  model  for  practical 
use.  At  the  same  time,  the  development  of  reliable  and  valid  methods  for  quan¬ 
tifying  and  identifying  aberrant  response  patterns  would  provide  a  potentially 
useful  source  of  additional  information  on  test-taking  behavior  of  individuals. 

Related  Research 

The  question  of  fit  of  individuals  to  the  ICC  models  can  be  conceptualized 
as  investigating  the  variability  of  a  single  individual  in  a  single  testing  sit¬ 
uation.  Wright  (1977),  in  suggesting  that  to  postulate  and  to  study  such  a  phe¬ 
nomenon  would  be  to  "wreak  havoc  with  the  logic  and  practice  of  measurement" 
exemplifies  an  attitude  which  may,  in  part,  account  for  the  meager  literature  on 
the  topic.  It  is  more  likely,  however,  that  the  development  of  sufficiently  re¬ 
fined  measurement  techniques  to  handle  such  a  difficult  problem  has  not  occurred 
until  very  recently.  The  development  of  computerized  testing  together  with  the 
development  of  latent  trait  test  theory  was  necessary  to  bring  about  the  possi¬ 
bility  of  measuring  individual  variablity  with  a  single  test. 
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Most  of  the  existing  research  consists  of  tentative  theoretical  approaches 
with  closing  exhortations  for  further  study.  The  approaches  to  this  problem 
differ  widely  in  theoretical  orientation  and  in  terminology  used.  Mosier  (1942) 
first  referred  to  individual  variability  in  mental  test  theory  from  a  psycho¬ 
physical  orientation;  and  Levine  and  Rubin  (1976)  referred  to  aberrance  indices 
from  the  view  of  signal  detection  theory.  Lumsden  (1977)  used  the  Thurstoneian 
approach  of  categorical  judgment  to  propose  the  idea  of  person  reliability. 

Weiss  (1973)  used  data  from  adaptive  testing  to  develop  consistency  scores,  and 
Vale  and  Weiss  (1975)  further  developed  the  earlier  idea  of  consistency  scores 
into  an  empirical  study  of  subject  characteristic  curves.  Wright  (1977)  used 
the  one-parameter  logistic  model  to  propose  the  idea  of  item  residuals  and  to 
\  refute  the  notion  of  what  he  called  person  sensitivity  in  testing.  Clearly, 

the  idea  is  still  new,  hazily  formulated  on  a  theoretical  level,  with  very  scarce 
evidence  of  any  empirical  studies. 

Mosier’ s  psychophysical  approach.  The  first  reference  in  the  testing  lit¬ 
er  a  ture~to_anindTviduarrs_variabiTTty^  within  a  single  ability  testing  situation 
was  in  a  two-part  study  by  Mosier  (1940,  1942).  The  emphasis  in  this  study  was 
on  the  fundamental  relationships  between  the  field  of  mental  test  theory  and 
the  methods  of  measuring  psychophysical  processes.  This  comparison  included  re¬ 
lating  the  constant  method  of  psychophysical  measurement  with  scoring  by  the  num¬ 
ber-correct  method  in  mental  testing.  Mosier  asserted  that  a  composite  score 
is  an  imperfect  representation  of  an  individual's  test  score  and  depends  on  the 
individual's  variability,  just  as  an  individual's  threshold  in  psychophysics 
depends  on  the  ambiguity  of  the  stimulus;  as  a  stimulus  is  variable  with  respect 
to  a  group  of  judges,  so  an  individual  is  variable  with  respect  to  a  group  of 
items. 

Mosier  likened  the  ambiguity  (discriminal  dispersion)  of  a  stimulus  in  psy¬ 
chophysics  to  individual  variability  in  mental  test  performance.  He  postulated 
the  distribution  of  the  proportion  of  correct  answers  for  one  individual  across 
items  of  differing  difficulty  as  the  integral  of  the  normal  probability  curve 
and  the  variability  of  that  individual  as  the  standard  deviation  of  the  proba¬ 
bility  function  whose  integral  is  the  proportion  of  correct  answers  as  a  func¬ 
tion  of  difficulty.  Mosier  applied  the  constant  process  of  psychophysics  to  a 
set  of  test  data  (of  unspecified  characteristics)  and  estimated  the  difficulty 
of  median  error  for  individuals  (ability  level)  and  its  dispersion.  He  found 
odd-even  reliability  of  ability  level  estimated  by  this  method  to  be  .88.  The 
reliability  of  the  person  variability  index  was  .55,  a  value  significantly  dif¬ 
ferent  from  zero.  It  was  perhaps  this  apparent  low  reliability  estimate  which 
was  responsible  for  a  complete  lack  of  research  on  person  variability  for  the 
next  30  years. 

Weiss's  stradaptive  "trace  line".  The  idea  of  person  variability  within 
one  test  was  independently  developed  by  Weiss  (1973)  as  a  by-product  of  comput¬ 
erized  adaptive  testing.  In  the  design  of  the  stratified-adaptive  (stradaptive) 

.  test,  he  ordered  ability  test  items  by  difficulty  levels  into  strata.  In  examin¬ 

ing  testee  performance  on  stradaptive  tests,  Weiss  noted  that  individuals  who  cor¬ 
rectly  answered  items  of  the  same  average  difficulty  level  differed  in  terms  of 
the  proportion  of  items  they  answered  correctly  at  different  difficulty  levels. 

To  examine  differences  in  individual  variability,  Weiss  proposed  the  concept 
of  a  "trace  line"  for  a  testee' s  item  responses,  with  items  divided  into  strata 
of  increasing  difficulty  on  the  ar-axis  and  proportion  correct  for  an  individual 
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on  each  stratum  on  the  z/-axis,  duplicating  the  suggestion  of  Mosier  30  years 
earlier.  Weiss  hypothesized  as  did  Mosier,  that  proportion  correct  would  de¬ 
crease  as  stratum  difficulty  increased.  Also  echoing  Mosier,  he  proposed  that 
the  steepness  of  the  slope  be  interpreted  as  an  index  of  the  consistency  of  an 
individual's  item  responses  and  the  capability  of  the  item  pool  to  discriminate 
an  individual's  ability  level.  The  point  of  inflection  of  the  curve,  where  50% 
of  the  items  were  answered  correctly  (for  f ree-response  items)  was  proposed  as 
an  indicator  of  the  difficulty  of  the  item  pool  for  an  individual  or  the  posi¬ 
tion  of  that  individual  on  the  trait  continuum.  To  operationalize  the  concept 
of  person  variability,  or  what  Weiss  called  "consistency,"  he  suggested  calcu¬ 
lating  several  indices,  including  the  standard  deviation  of  item  difficulties 
answered  correctly  and  the  standard  deviation  of  item  difficulties  encountered. 

Vale  and  Weiss  (1975)  empirically  studied  some  aspects  of  individual  "con¬ 
sistency"  as  part  of  a  larger  study  of  computer-administered  adaptive  testing. 
Included  in  this  study  was  a  test  of  the  hypothesis  that  more  consistent  indi¬ 
viduals — those  with  smaller  errors  of  measurement  in  Mosier's  (1940,  1942)  formu¬ 
lation — would  have  more  stable  ability  estimates.  The  five  operationalizations 
of  consistency  originally  proposed  by  Weiss  (1973)  were  studied  as  moderators 
in  the  prediction  of  test-retest  reliability  of  ability  estimates.  The  standard 
deviation  of  item  difficulties  encountered  significantly  moderated  the  stability 
of  ability  estimates  in  the  expected  direction  as,  to  a  lesser  extent,  did  the 
standard  deviation  of  item  difficulties  answered  correctly. 

In  addition.  Vale  and  Weiss  (1975)  studied  the  test-retest  reliability  of 
the  "trace  line"  plots  for  individuals  and  introduced  the  new  term  "subject 
characteristic  curve"  for  these  trace  lines.  They  used  canonical  redundancy 
analysis  (Weiss,  1972)  on  the  proportion-correct-within-strata  data  (i.e., 
the  subject  characteristic  curves)  in  a  retest  situation.  The  results  indi¬ 
cated  a  high  degree  of  predictability  of  subject  characteristic  curves  on  one 
test  from  the  test  scores  on  the  other;  redundancies  indicated  from  47%  to  67% 
common  variance  across  the  two  testing  times.  These  results  indicated  a  good 
degree  of  stability  in  the  proportion  of  correct  responses  within  the  strata 
of  the  stradaptive  test  as  indexed  by  the  subject  characteristic  curves. 

Lumnden's  subject  c'harartevintie  aumw.  The  subject  characteristic  curve 
was  again  independently  proposed  by  Lumsden  (1977,  1978)  as  a  derivation  from 
Thurstone's  law  of  categorical  judgment.  Lumsden  proposed  an  attribute-based 
model  of  test  performance  in  which  a  person's  ability  fluctuates  in  trends 
(long-term  developmental  changes),  swells  (short-term  mood  swings),  and  tremors 
(moment-to-moment  shifts).  He  assumed  tremors  to  be  rapid,  random,  and  normally 
distributed  shifts  of  ability  occurring  from  moment  to  moment  within  a  single 
test  situation:  The  discriminal  dispersion  of  item  difficulties  stays  at  zero, 
and  it  is  only  person  ability  that  fluctuates.  If  the  momentary  location 
of  a  person's  ability  level  is  higher  than  the  point  location  of  the  item's  dif¬ 
ficulty,  the  person  will  answer  an  item  correctly.  If  ability  is  lower  at  any 
moment  than  the  item  difficulty  location,  the  person  will  answer  that  item  in¬ 
correctly.  Lumsden  then  extended  the  idea  to  the  plot,  for  a  single  person,  of 
item  responses  at  different  difficulty  levels,  which  he  called  the  "person  char¬ 
acteristic  curve."  He  suggested  that  the  person  characteristic  curve  is  "per¬ 
fectly  analogous  to  the  item  characteristic  curve."  Lumsden's  basic  assumptions, 
however,  are  different  from  the  ICC  theory  assumptions  underlying  item  charac¬ 
teristic  curves;  an  ICC  assumes  that  ability  level  is  constant,  not  fluctuating, 
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but  that  the  response  to  a  given  test  item  includes  a  random  error  component 
causing  observed  item  responses  to  fluctuate  around  true  ability  level. 

Levine  and  Rubin's  aberrancy  indices.  Other  approaches  to  the  study  of 
intra-individual  variability  within  a  test  have  concentrated  on  the  use  of 
intra-individual  variability  for  test  validation  rather  than  on  individual  abil¬ 
ity  assessment.  Levine  and  Rubin  (1976)  and  Levine  (1979)  initiated  several 
studies  concerned  with  individuals  or  groups  of  individuals  for  whom  a  given 
test  might  be  invalid  and/or  inappropriate.  Among  the  populations  of  concern 
were  those  who  obtain  higher  scores  because  of  cheating  and  those  who  obtain 
lower  scores  because  of  lack  of  proficiency  in  English.  Levine  and  Rubin  de¬ 
veloped  several  types  of  "aberrance  indices"  to  determine  at  greater  than  chance 
level,  without  reference  to  demographic  data,  examinees  for  whom  a  given  test 
would  be  inappropriate. 

Their  basic  assumption  was  that  an  aberrant  examinee's  response  pattern 
to  items  of  varying  difficulty  should  have  a  low  marginal  probability,  since 
it  is  unlikely  that  a  high-ability  examinee  would  incorrectly  answer  an  easy 
item  or  a  low-ability  examinee  correctly  answer  a  difficult  item.  Marginal 
probability  was  operationally  defined  as  the  average  of  the  conditional  proba¬ 
bilities  of  a  correct  response  on  each  item  of  difficulty  level  b  for  an  indi- 

Yl 

vidual  of  ability  level  0.  If  H  =  the  number  of  items,  there  are  2  marginal 
probabilities.  These  were  ranked,  with  all  probabilities  below  an  arbitrary 
cutoff  point  considered  to  represent  aberrant  response  patterns. 

Using  a  monte  carlo  simulation  with  3,000  hypothetical  examinees,  200  of 
whom  were  aberrant  responders,  Levine  and  Rubin  (1976)  conducted  several  studies 
at  different  cutoff  points  on  the  marginal  probabilities  to  determine  if  aber¬ 
rant  examinees  could  be  identified  at  a  rate  significantly  greater  than  chance. 
Receiver  operator  curves  (ROC)  from  signal  detection  theory  were  used  to  evalu¬ 
ate  the  performance  of  their  experimental  methods  of  identifying  aberrance. 

The  best  method  identified  50%  of  the  spuriously  low  and  80%  of  the  spuriously 
high  examinees,  while  only  mistaking  10%  of  the  normal  examinees  as  aberrant. 

When  compared  to  the  chance  level  predictions  of  only  10%  of  spuriously  high 
or  low  examinees  identified  while  mistaking  10%  of  the  normal  examinees,  this  study 
seemed  to  have  yielded  results  that  merit  further  study.  However,  a  closer  look 
reveals  the  impracticality  of  Levine  and  Rubin's  best  method.  Even  if  the  aber¬ 
rance  indices  identified  80%  of  the  aberrant  examinees  (160  out  of  200)  and  only 
misclassif ied  10%  of  the  non-aberrant  examinees  (280  out  of  2,800),  this  would 
still  result  in  eliminating  as  invalid  the  test  results  of  280  non-aberrant  ex¬ 
aminees.  Levine  and  Rubin  seemed  to  completely  ignore  this  problem  in  their 
paper. 

Wright's  residual  analysis.  Wright's  (1977;  Wright  &  Mead,  1977;  Wright  & 
Stone,  1979)  concern  with  intra-individual  variability  in  a  single  situation  fo¬ 
cuses  on  the  interaction  of  a  person  with  specific  test  items.  Wright  has  devel¬ 
oped  methods  for  identifying  items  which  may  be  invalid  for  a  certain  person  or 
group  of  persons  and  which  can  then  be  excluded  from  consideration  when  calculat¬ 
ing  ability  estimates  from  those  items.  Wright  (1977;  Wright  &  Stone,  1979, 
pp.  165-180)  cited  tendencies  such  as  guessing,  cheating,  "sleeping"  (getting 
bored  with  a  test  and  answering  later  items  in  a  more  haphazard  fashion),  "fum¬ 
bling"  (e.g.,  answering  earlier  items  with  difficulty  because  of  confusion  with 
test  format),  and  cultural  bias.  Wright's  method  (Wright  &  Stone,  1979;  Mead, 
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in  prep.)  utilizes  standardized  squares  of  the  residuals  between  an  item's 
difficulty  level  and  a  person's  ability  level  after  fitting  the  one-parameter 
logistic  model  to  the  test  data.  If  these  residuals  indicate  a  significantly 
low  probability  of  responding  correctly  or  incorrectly  and  the  person  responded 
in  that  way,  the  tester  then  has  reason  to  suspect  that  the  item  or  item  set 
may  be  invalid  for  that  particular  person. 

This  approach  is  consistent  with  Wright's  use  of  the  one-parameter  Rasch 
model,  which  recognizes  only  a  difficulty  level  of  items  but  not  a  discrimina¬ 
tion  parameter  ar  a  guessing  parameter.  Following  the  assumptions  of  this  model, 
Wright  maintained  that  the  probability  of  success  on  more  difficult  items  should 
always  be  less  than  on  easier  items  no  matter  who  attempts  the  items,  so 
the  test  developer  must  prevent  variation  in  item  discrimination  sufficient  to 
produce  item  characteristic  curves  that  cross.  Also,  following  this  logic,  a 
higher  ability  person  should  have  a  better  chance  for  success  no  matter  what 
the  difficulty  of  the  item  attempted,  so  the  test  developer  must  prevent 
variation  in  person  sensitivity;  the  result  is  that  person  characteristic  curves 
must  not  cross  each  other.  Wright  claimed  that  the  practical  problem  of  varia¬ 
tion  in  item  discrimination  and  person  sensitivity  can  be  treated  through  super¬ 
vision  rather  than  estimation,  using  residuals  and  deleting  inappropriate  items 
from  a  person's  responses  without  interfering  with  estimates  of  a  person's  abil¬ 
ity.  Wright's  method  seems  to  oversimplify  response  data  by  ignoring  the  effects 
of  item  discrimination  and  guessing,  as  well  as  precluding  the  possibility  of 
more  subtle  diagnoses  of  added  dimensions  acting  as  moderator  variables  in  the 
testing  situation. 

Sumvary  and  Objectives 

The  limited  literature  on  person  variability  within  a  test  thus  seems  to 
have  three  major  trends:  1)  the  direct  analysis  of  person  variability  as  orig¬ 
inally  suggested  by  Mosier,  later  called  the  testee's  trace  line  by  Weiss  and 
subject  characteristic  curve  by  Vale  and  Weiss  and  the  person  characteristic 
curve  by  Lumsden  (1977);  2)  designation  of  highly  variable  persons  as  aberrant 
by  Levine  and  Rubin;  and  3)  the  elimination  of  aberrant  person-item  interactions 
by  Wright.  Careful  analysis  of  these  three  approaches  indicates  that  the  first 
approach  (that  of  the  person  characteristic  curve)  is  the  most  general  of  the 
three,  subsuming  the  other  two  as  special  cases:  If  the  entire  pattern  of  a 
testee's  responses  is  studied  as  a  function  of  difficulty  level  of  the  items, 
the  identification  of  aberrant  response  patterns  or  person-item  restrictions 
follows  directly.  In  addition,  postulating  a  person  characteristic  curve  in 
conjunction  with  ICC  theory  provides  a  means  of  testing  for  single  individuals, 
whether  their  response  patterns  fit  the  theory  regardless  of  the  number  of 
parameters  assumed. 

The  purpose  of  this  study  was  to  further  explore  the  Mosier-Weiss-Lumsden 
idea  of  the  person  characteristic  curve,  to  determine  its  utility  as  a  means  of 
describing  testee  response  variability,  and  to  study  the  fit  of  individuals  to 
the  ICC  model.  To  emphasize  that  the  curve  is  derived  from  the  responses  of  an 
individual  to  a  set  of  test  items,  it  was  renamed  the  "person  response  curve." 

The  focus  of  this  research  is  on  the  investigation  of  the  reliability  and  other 
psychometric  characteristics  of  the  person  response  curve. 
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Figure  1  is  a  plot  of  person  response  curves  (PRCs)  for  each  of  three  hy¬ 
pothetical  testees.  To  obtain  these  plots,  a  number  of  items  of  different  dif¬ 
ficulty  levels  are  administered  to  a  testee.  For  each  difficulty  level,  the 
proportion  of  items  answered  correctly  is  plotted  as  a  function  of  difficulty 
level.  The  resulting  PRC  is  representative  of  one  person's  performance  on  one 
test . 


Figure  1 

Observed  Person  Response  Curves  for  Three  Hypothetical 
Persons  with  the  Same  Ability  Level  (0=0.0) 


Low  Ability  (Standard  Scores)  High  Ability 


Figure  1  shows  the  PRC  plots  of  three  different  persons — A,  B,  and  C — who 
have  all  obtained  the  same  score  on  the  test  by  answering  50%  of  the  total  test 
questions  correctly.  Thus,  all  the  curves  cross  at  the  point  on  the  vertical 
axis  of  .50,  and  their  average  proportion  correct  across  all  item  difficulty 
levels  is  .50.  The  center  point  of  the  curve  can  then  be  projected  downward  to 
the  horizontal  axis  to  obtain  an  ability  level  estimate  (0)  of  0.0,  which  in 
standard  score  terms  is  at  the  mean  of  a  population.  Yet,  Figure  1  illustrates 


that  although  these  three  persons  all  achieved  the  same  total  score  on  this 
test,  they  obtained  that  score  in  substantially  different  ways. 

As  shown  in  Figure  1,  the  three  testees — A,  B,  and  C — differ  in  a  number 
of  variables.  Note  that  the  curve  for  Person  A  has  a  substantially  steeper 
slope  around  its  center  point  than  does  that  for  Persons  B  and  C.  This  shows 
that  with  this  particular  item  pool.  Person  A  was  measured  more  precisely  than 
either  Person  B  or  C,  or  (in  Mosier's,  1942,  terms)  that  the  error  of  measure¬ 
ment  for  Person  A  was  smaller.  Thus,  in  addition  to  ability  level  scores,  in¬ 
formation  on  individual  precision  of  measurement  is  derivable  from  the  PRC. 

The  third  type  of  information  derivable  from  the  study  of  PRCs  is  a  per¬ 
son's  guessing  behavior.  This  is  shown  in  Figure  1  as  the  lower  right-hand 
portion  of  the  curve  for  each  testee.  Note  that  Persons  B  and  C  correctly  an¬ 
swered  very  difficult  items  at  a  nonzero  level.  It  may,  therefore,  be  hypothe¬ 
sized  that  they  were  guessing.  However,  Person  A  answered  none  of  the  difficult 
items  correctly.  It  may  be  hypothesized  that  this  testee,  unlike  the  other  two, 
was  not  guessing. 

A  fourth  type  of  information  possibly  derivable  from  the  PRC  is  a  careless¬ 
ness  index,  shown  in  the  upper  left-hand  corner  of  Figure  1.  Persons  B  and  C 
answered  only  about  80%  of  a  set  of  very  easy  items  correctly,  even  though 
their  ability  levels  were  considerably  higher.  On  the  other  hand,  Person  B  an¬ 
swered  the  same  items  all  correctly,  as  would  be  expected  for  a  person  with  a 
relatively  high  ability  level.  Thus,  it  could  be  hypothesized  that  Persons  B 
and  C  were  more  careless  than  Person  A. 

Finally,  the  fifth  kind  of  potential  information  derivable  from  a  study 
of  PRCs  is  shown  for  Person  B  and  is  a  deviation  from  a  unidimensional  response 
pattern,  as  suggested  by  Mosier  (1940,  p.  364).  That  is,  the  test  performance 
of  Person  B  shows  that  he/she  was  answering  correctly  beyond  the  chance  level 
some  difficult  items  which  were  beyond  his/her  ability  level.  Since  such  test 
response  behavior  is  inconsistent  with  a  unidimensional  hypothesis,  there  may 
be,  for  this  individual,  some  dimension  accounting  for  test  performance  other 
than  the  one  being  measured  by  the  test  for  other  persons. 

Thus,  the  PRC  provides  the  potential  for  considerable  additional  informa¬ 
tion  from  an  individual's  test  response  record.  All  that  is  required  to  obtain 
an  observed  PRC  is  1)  to  administer  to  an  individual  a  number  of  items  of  vary¬ 
ing  difficulty  levels,  2)  to  determine  the  proportion  of  items  answered  correct¬ 
ly  at  each  difficulty  level,  and  3)  to  plot  those  proportions  as  a  function  of 
item  difficulty  level. 

Expected  Verson  Response  Curves 

Although  the  observed  PRCs  are  useful  in  describing  a  person's  test  behav¬ 
ior,  by  themselves  they  provide  no  means  of  determining  whether  observed  fluc¬ 
tuations  in  the  curve  represent  important  characteristics  of  the  individual  or 
merely  chance  deviations.  ICC  theory,  however,  permits  the  derivation  of 
expected  PRCs,  which  can  then  be  used  to  evaluate  whether  aspects  of  the  observed 
PRCs  are  real  or  chance  fluctuations.  In  addition,  these  observed  PRCs  permit 
testing  the  fit  of  individual  persons  to  the  ICC  model  for  a  given  set  of  test 
item  responses. 

Expected  PRCs  are  derivable  from  either  the  one-,  two-,  or  three-parameter 
ICC  models.  Derivation  of  the  expected  PRC  requires  an  ability  estimate,  0, 
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and  the  item  parameters  for  all  the  items  administered.  Generally,  the  ICC  item 
parameters  of  the  items  administered  will  have  been  estimated  in  advance  by  a 
method  such  as  Lord's  LOGIST  (Wood,  Wingersky, &  Lord,  1978)  or  one  of  Urry's 
(e.g.,  Schmidt  &  Urry,  1976)  estimation  procedures;  the  difficulty  ( h )  parameters 
will  have  been  used  to  order  the  items  by  difficulty  level  to  obtain  the  observed 
PRC.  Estimates  of  ability  level  (0)  may  be  obtained  using  programs  described  by 
Bejar  and  Weiss  (1979). 


In  the  case  of  the  three-parameter  logistic  ICC  model,  the  expected  prob¬ 
ability  of  a  correct  response  for  any  given  test  item  (P  )  is  given  as  a  func- 
A  9 

tion  of  0,  a,  fc,  and  a  by  the  three-parameter  logistic  equation: 


P 
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A 
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where 


0  is  the  person's  estimated  ability  score: 
g  is  an  item; 

a  is  the  ICC  item  discrimination  parameter; 

£7 

by  is  the  ICC  item  difficulty  parameter; 


a  is  the  ICC  item  lower  asymptote  ("guessing")  parameter'  and 

ii 

D  is  equal  to  1.7. 


[1] 


If  a  two-parameter  ICC  model  is  used,  the  terms  in  Equation  1  with  a  are  de¬ 
leted;  if  the  one-parameter  (Rasch)  model  is  used,  the  a  values  are  set  to  1.0. 

Using  the  estimated  probability  of  a  correct  response  for  each  item  result¬ 
ing  from  Equation  1,  an  expected  PRC  can  be  plotted.  This  is  illustrated  in 
Figure  2.  Figure  2a  illustrates  three-parameter  ICCs  for  nine  test  items, 
grouped  at  three  levels  of  difficulty.  Difficulties  of  Items  1,  2,  and  3  are 
relatively  low,  between  -2.0  and  -2.5;  Items  4,  5,  and  6  are  clustered  around  a 
difficulty  of  £>=0.0;  and  Items  7,  8,  and  9  are  the  most  difficult  set,  with 
£>=+2.0.  The  dashed  vertical  line  in  Figure  2a  represents  a  person  with  a  0=1.0. 

The  estimated  probability  of  a  correct  response  to  each  item,  resulting 
from  Equation  1,  is  shown  in  Figure  2a  by  the  dotted  horizontal  line  extending 
from  the  ICC  to  the  vertical  axis  at  0=1.0.  Thus,  for  Items  1  and  2,  the  prob¬ 
ability  of  a  correct  response  is  essentially  1.0;  and  for  Item  3,  about  .98. 

For  Items  4,  5,  and  6  the  probabilities  are  .80,  .82,  and  .85,  respectively; 
and  for  Items  7,  8,  and  9,  P  =  .08,  .10,  and  .22.  These  nine  probabilities  are 
plotted  in  Figure  2b  and  constitute  an  expected  PRC  for  a  person  with  0=1.0, 
with  the  probability  for  each  item  plotted  at  its  difficulty  level.  It  will  be 
noted  that  for  Item  Groups  4,  5,  6  and  7,  8,  9  in  Figure  2a,  the  expected  pro¬ 
portions  correct  are  not  monotonically  decreasing  as  might  be  expected  from 
theoretical  considerations.  This  is  due  to  the  differing  discriminations  of 
the  items  (as  illustrated  in  Figure  2a).  Thus,  to  construct  an  estimated  PRC, 
it  might  be  desirable  to  plot  a  smoothed  curve  around  the  values  plotted  in 
Figure  2b. 


Probability  of  Correct  Response  P( 9) 
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Figure  2 

Estimating  the  Expected  Person  Response  Curve  (PRC)  for  a  Person 
with  0=1.0  Using  Nine  Test  Items 


(a)  Three-Parameter  Item  Characteristic  Curves 
Grouped  at  Three  Levels  of  Difficulty 


(b)  Expected  Person  Response  Curve  (PRC) 
for  a  Person  with  8=1.0 
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One  way  of  smoothing  expected  PRCs  is  to  average  the  probabilities  of 
a  correct  response  to  items  close  in  difficulty  level.  Since  the  observed  PRC 
utilizes  the  proportion  of  correct  responses  to  a  set  of  items  of  similar  dif¬ 
ficulty,  averaging  of  the  probabilities  of  correct  responses  in  the  expected 
PRC  will  facilitate  the  direct  comparison  of  observed  and  expected  PRCs.  Lord 
has  referred  to 

k 

5-2  PA 9)  [2] 

0=1  9 

as  the  expected  true  score  on  a  set  of  test  items,  where  k  is  the  number  of 
items  for  which  the  expected  probability  of  a  correct  response  has  been  computed 
from  Equation  1  and  £  is  the  expected  number  of  correct  responses  in  k  items. 

An  estimate  of  the  proportion  of  correct  responses  on  a  subset  of  items  is 

k 

pa  =  Vk  =  £  PAQ)/k  [3] 

s  t«  1  9 

A 

or  the  average  proportion  correct  on  the  k-item  subset.  Values  of  p  ,  the  ex- 

O 

pected  proportion  correct  on  the  three  subsets  of  items  in  Figure  2a,  are  shown 
by  X' s  in  Figure  2b.  Connecting  these  values  with  a  curve  gives  the  expected 
PRC  based  on  the  three-parameter  logistic  ICC  model,  which  for  any  individual 
is  directly  comparable  to  his/her  observed  PRC. 


A 

The  expected  PRC  is  therefore  simply  a  function  of  0  and  the  item  parame¬ 
ters.  Thus,  for  a  given  6  and  a  given  set  of  items,  the  expected  values  of  the 
PRC  will  be  constant.  The  observed  PRC,  on  the  other  hand,  results  from  the 
interaction  of  an  individual  with  the  items.  If  an  individual  answers  the  set 
of  test  items  strictly  in  accordance  with  the  ICC  model,  the  observed  PRC  should 
conform  to  the  expected  PRC.  If  an  individual's  test  item  responses  are  deter¬ 
mined  by  factors  other  than  a  single  unidimensional  trait,  deviations  of  the 
observed  PRC  from  the  expected  PRC  will  appear. 

Observed  versus  Expected  PRCs 

Figure  3  shows  hypothetical  observed  and  expected  PRCs  for  an  individual 
with  0»O.O.  The  observed  PRC  (solid  line)  is  plotted  from  data  on  test  items 
grouped  at  seven  points  on  the  item  difficulty  continuum:  b=± 3,  ±2,  ±1,  and  0. 
The  expected  PRC  data  points  (dashed  line)  were  derived  from  Equations  1  and  3 
for  the  test  items  administered,  using  the  same  item  difficulty  groupings.  To 
determine  whether  a  person's  carelessness,  guessing,  dimensionality,  or  pre¬ 
cision  are  significantly  different  from  those  predicted  by  the  model,  an  expec¬ 
ted  PRC  may  be  determined  for  any  person  on  any  set  of  test  items  with  esti¬ 
mated  ICC  parameters,  and  the  observed  PRC  may  be  compared  to  it.  If  the 
observed  PRC  differs  from  the  expected  model-based  prediction  in  any  respect, 
the  observed  PRC  describes  a  significant  aspect  of  the  person’s  testing  behavior. 
Once  quantified,  these  person-fit  variables  might  then  be  usable  in  prediction 
situations  to  increase  the  accuracy  of  predictions  made  from  test  scores.  This 
could  be  done  by  including  additional  information  on  guessing,  carelessness, 
precision,  and  dimensionality  and  on  other  aspects  of  a  person's  test  perfor¬ 
mance  as  reflected  in  the  relationship  of  observed  and  expected  PRCs. 


Figure  3 

Observed  and  Expected  Person  Response  Curves  for  a  Person  with  0=0.0 
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Method 

The  following  data  analyses  constitute  a  first  examination  of  observed  PRCs 
and  their  relationships  with  expected  PRCs  for  a  group  of  individuals  on  a  test 
designed  to  permit  study  of  the  characteristics  of  PRCs.  The  major  analyses 
were  directed  at  establishing  the  reliability  of  observed  PRCs  and  the  fit  of 
observed  and  expected  PRCs.  Some  correlates  of  person-fit  indices  derived  from 
the  PRC  were  also  investigated. 


Subjects 


Subjects  were  151  undergraduate  students  in  the  introductory  psychology 
course  at  the  University  of  Minnesota.  These  students  volunteered  for  the  study 
in  return  for  bonus  points  that  would  count  toward  their  final  grade.  Students 
were  given  a  posttest  debriefing,  which  consisted  of  a  brief  explanation  of  the 
purpose  of  the  study.  No  test  results  were  given,  due  to  the  lengthy  procedures 
for  keypunching  and  scoring  the  data. 
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Test  Instrument 


The  test  consisted  of  216  five-option  multiple-choice  vocabulary  items. 
The  items  were  chosen  from  a  preexisting  item  pool  of  over  500  items  with  ICC 
difficulty  and  discrimination  parameters  that  had  been  developed  on  a  similar 
population  of  undergraduates  in  the  introductory  psychology  course  in  previous 
years  (McBride  &  Weiss,  1974).  The  216  items  were  selected  for  high  discrimi¬ 
nating  power  and  for  spread  of  difficulty  ( a  parameters  were  set  at  .20  for 
all  items) . 

The  test  was  given  as  a  paper-and-pencil  test  without  time  limits.  Items 
were  randomly  ordered  for  administration  so  that  easy  and  difficult  items  were 
spread  throughout  the  test.  In  addition,  to  control  for  any  effects  of  item 
order,  the  pages  of  test  questions  were  ordered  in  six  different  ways  so  that 
only  one-sixth  of  the  students  took  the  test  in  the  same  page  order. 

Observed  PRCs 


Stratifying  the  test.  In  order  to  transform  student  response  data  into 
observed  PRCs,  test  items  were  divided  into  strata  containing  an  equal  number 
of  items,  with  each  stratum  representing  a  different  level  of  difficulty.  This 
was  done  by  reordering  the  items  by  difficulty  level  (£  parameter),  then  divid¬ 
ing  them  into  nine  separate  groups  (or  strata)  of  24  items  each.  In  this  way. 
Stratum  1  contained  the  24  easiest  items  and  Stratum  9  contained  the  24  most 
difficult  items. 

Items  were  then  ordered  within  each  stratum  by  discrimination  (a)  level, 
with  the  most  discriminating  item  the  first  item  in  the  stratum  and  the  least 
discriminating  item  the  24th  item  in  the  stratum.  To  investigate  the  parallel 
forms  reliability  of  observed  PRCs,  each  stratum  was  then  split  into  two 
parallel  substrata  of  items  with  similar  difficulty  and  discrimination  para¬ 
meters.  This  provided  18  substrata  of  12  items  each.  Item  difficulty  and 
discrimination  parameters  for  all  items  by  stratum  and  substratum  are  in  Appen¬ 
dix  Table  A. 

Items  were  scored  as  either  correct  ("1")  or  incorrect  ("0"),  with  omitted 
items  scored  as  incorrect.  The  correct-incorrect  response  vectors  were  then 
reordered  by  item  difficulty  level  for  each  student.  The  proportion  of  correct 
responses  was  then  computed  on  each  of  the  nine  strata  and  on  each  of  the  18 
substrata  for  each  student,  providing  information  for  observed  PRCs  based  on 
all  216  items  (i.e.,  nine  24-item  subtests  of  differing  difficulty  levels)  and 
split-half  parallel  observed  PRCs,  each  based  on  nine  12-item  subtests. 

To  examine  the  characteristics  of  the  items  constituting  the  strata,  inter¬ 
nal  consistency  reliability  of  each  of  the  nine  strata  was  determined  using 
Cronbach's  alpha.  Parallel  forms  reliability  of  the  nine  pairs  of  parallel 
substrata  was  determined  by  the  product -moment  correlation  coefficient  between 
proportion-correct  scores  on  each  of  the  nine  pairs  of  substrata. 

Estimated  PRCs 


Using  Program  LINDSCO  (Bejar  &  Weiss,  1979),  Owen' s  Bayesian  ability  esti¬ 
mation  procedure  was  used  to  compute  ability  estimates  (0)  for  each  student 
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A 

based  on  his/her  responses  to  all  216  items  in  the  test.  This  8  was  then  used 
in  Equations  1  and  3,  in  conjunction  with  the  item  parameters  for  the  24  items 
in  each  stratum,  to  obtain  the  expected  proportion-correct  score  in  each  of  the 

nine  strata  (p  ) .  The  p  values  then  constituted  the  expected  PRC  for  each 

8  8 

student,  assuming  the  three-parameter  ICC  model.  This  process  was  repeated  for 
each  of  the  parallel  substrata,  yielding  expected  PRCs  for  each  student  from  each 
of  the  two  108-item  parallel  pools. 

Correlates  of  Observed  PRCs 

In  addition  to  the  vocabulary  items,  11  five-alternative  Likert-type  ques¬ 
tions  were  used  to  assess  psychological  variables  hypothesized  to  be  related  to 
PRC  data.  These  questions  were  taken  from  psychological  reactions  scales  devel¬ 
oped  by  Betz  and  Weiss  (1976),  with  some  slight  modifications.  Four  items  were 
used  in  a  Perceived  Test  Difficulty  scale,  four  in  a  Test-Taking  Anxiety  scale, 
and  three  items  in  a  Test-Taking  Motivation  scale. 

The  psychological  reactions  scale  items  (shown  in  Appendix  Table  B)  were 
scored  "1"  through  "5,"  with  the  first  response  alternative  for  each  item  scored 
as  "1"  and  each  succeeding  alternative  scored  a  point  higher.  Item  scores  were 
weighted  positively  or  negatively  (see  Table  B) ,  according  to  how  they  were  keyed 
on  the  psychological  reactions  scale.  The  total  number  of  item  score  points 
ranged  from  +8  to  -8  on  the  Perceived  Test  Difficulty  and  the  Test-Taking  Anxiety 
scales,  and  from  +9  to  -3  on  the  Test-Taking  Motivation  scale. 

Reliability  of  Observed  PRCs 

Within-  and  between-persons  D2  indices.  To  determine  the  split-half  parallel 
forms  reliability  of  observed  PRCs,  a  D2  statistic  was  computed  for  each  student, 
comparing  his/her  observed  PRC  data  (proportion  correct)  on  each  of  the  paired 
substrata;  thus, D2  indexed  the  similarity  of  the  two  split-half  PRCs  for  each 
student.  A  D2  value  of  zero  would  indicate  that  the  two  split-half  PRCs  were 
identical;  large  values  would  indicate  differences  between  the  two  PRCs. 

Although  the  D2  statistic  is  a  commonly  used  descriptive  statistic  in  com¬ 
paring  profiles  (Cronbach  &  Gleser,  1953),  no  sampling  distribution  is  available 
for  it.  In  order  to  obtain  some  data  with  which  to  compare  the  split-half  D2 
data,  four  other  sets  of  between-persons  D2  statistics  were  computed  for  compari¬ 
son  purposes  with  the  within-persons  reliability  D2.  Students  were  paired 
randomly  into  75  pairs.  The  first  D2  statistic  [D(AA)]  was  obtained  by  comparing 
the  observed  PRC  data  for  one  of  the  split-half  PRCs  (arbitrarily  designated 
"A")  of  each  individual  student  with  those  of  his/her  randomly  paired  student. 

The  second  D2  statistic  [ D (BB) ]  was  obtained  by  comparing  the  same  pairs  on  their 
observed  PRC  data  from  their  other  (Subset  B)  substrata.  The  third  and  fourth 
D2  statistics  [D(AB)  and  D(BA) ]  were  obtained  by  comparing  one  student's  first 
split-half  PRC  with  the  other  student's  second  split-half  observed  PRC. 

Group  means  and  standard  deviations  were  then  computed  for  the  four  between- 
persons  D2  indices  (D(AA) ,  D(BB) ,  D(AB) ,  and  D(BA) ]  and  the  one  within-persons  D2 . 
Fisher's  t  test  for  differences  in  means  was  used  to  determine  if  within-persons 
split-half  observed  PRCs  on  parallel  forms  of  the  test  were  more  similar  to  each 
other  than  they  were  to  the  between-persons  D2  from  randomly  selected  indivi¬ 
duals.  If  observed  PRC  data  were  reliable,  it  would  be  expected  that  profiles 
within  persons  would  have  significantly  lower  mean  D2  values  than  profiles 


between  persons,  especially  considering  differences  in  ability  level  between 
randomly  paired  individuals. 


Chi-square  tests  of  independence.  A  second  approach  to  the  study  of  the 
reliability  of  observed  PRCs  used  a  chi-square  test  of  independence.  For  each 
student,  the  2  *  9  contingency  table  included  the  number  of  correct  responses 
on  each  of  the  parallel  substrata  in  each  of  the  rows  of  the  9-column  table. 
Chi-square  tests  of  independence  were  computed  separately  for  each  student. 

If  the  paired  substrata  were  parallel,  a  nonsignificant  value  of  chi-square 
would  be  supportive  of  the  reliability  of  observed  PRC  data.  Although  this 
chi-square  test  violated  the  usual  assumption  of  independence  because  the  cell 
frequencies  were  based  on  the  same  student’s  responses  to  all  the  questions, 
it  may  be  argued  that  the  students'  test  item  responses  are  locally  independent 
(i.e.,  are  independent  for  a  given  student  who  has  a  fixed  value  of  0)  and, 
therefore,  that  the  test  is  not  inappropriate.  Further  study  of  this  problem 
is  necessary,  however,  in  future  applications  of  this  index. 

PRCs  and  Person-Fit 


Observed  versus  expected  PRCs.  Expected  PRCs  were  determined  for  each 
student  using  the  method  described  above.  To  determine  if  students'  responses 
to  these  ability  test  items  were  consistent  with  the  three-parameter  ICC  model, 
a  chi-square  goodness-of-f it  statistic  was  computed  between  each  student’s 
observed  and  expected  PRC  data  across  the  nine  strata.  If  the  PRC  is  an  ade¬ 
quate  index  of  model  fit,  the  mean  chi-square  for  the  group  would  be  nonsignif¬ 
icant.  On  an  individual  level,  at  an  .05  level  of  significance,  chi-square 
goodness-of-f it  values  should  be  statistically  significant  for  7.55  of  the  151 
students  by  chance  alone,  assuming  the  null  hypothesis  of  no  significant  de¬ 
viations  from  person-fit.  More  significant  chi-square  values  would  indicate 
a  tendency  for  lack  of  fit  in  these  data. 

When  the  overall  level  of  fit  in  the  data  is  substantially  different  from 
the  chance  expectation,  it  is  still  difficult  to  conclude  from  the  overall 
goodness-of-f it  tests  that  a  specific  individual  exhibited  reliable  and  mean¬ 
ingful  lack  of  fit  to  the  ICC  model,  since  a  certain  number  of  such  deviations 
from  fit  will  occur  by  chance  alone.  To  identify  such  individuals,  two  sepa¬ 
rate  goodness-of-fit  tests  were  conducted  for  each  student  using  their  observed 
and  expected  PRC  data  on  each  of  the  parallel  substrata.  This  yielded  two 
chi-square  model  fit  statistics  for  each  student — one  for  each  of  the  two  sets 
of  substrata.  Assuming  that  the  two  chi-square  values  were  independent, 
reliable  person-non -f it  would  be  indicated  by  identifying  persons  with  signifi¬ 
cant  (p  <  .05)  chi-square  values  for  each  of  the  substrata  tests  of  indepen¬ 
dence;  the  probability  of  observing  such  a  result  by  chance  alone  would  be 
.05  x  .05,  or  .0025. 

PRCs  and  ability  level.  If  the  responses  of  most  persons  fit  the  ICC 
model,  the  observed  PRC  should  be  a  function  of  ability  level  (0),  just  as 
the  expected  PRC  is  a  function  or  ability  level.  To  investigate  this  possi¬ 
bility,  a  variation  of  the  D2  reliability  analysis  was  used.  Based  on 
observed  PRC  data  within  substrata,  students  were  first  matched  on  ability 
level  (0)  before  the  between-persons  D2  measures  were  computed.  These  mean 
0-matched  between-persons  D2  values  were  then  compared  to  the  within-persons 
D2  values,  on  the  hypothesis  that  there  should  be  little  difference  between 
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these  means  (and  considerably  less  difference  than  when  persons  were  matched 
without  regard  to  0)  if  observed  PRCs  were  primarily  a  function  of  ability 
level. 

Correlates  of  Observed  PRCs 

Pearson  product -moment  correlations  were  computed  among  scores  on  the 
three  psychological  reactions  scales,  the  within-persons  D2,  and  the  overall 
person-fit  chi-square.  Assuming  the  validity  of  the  psychological  reactions 
scales,  it  would  be  expected  that  both  the  D2  and  chi-square  values  would 
correlate  positively  with  Perceived  Test  Difficulty  and  Test-Taking  Anxiety. 
Chi-square  and  D2  values  were  also  correlated  with  ability  estimates  (0). 

Results 


Test  Characteristics 


Table  1  shows  the  means,  standard  deviations,  and  range  of  item  difficul¬ 
ties  (Z>)  and  proportion-correct  scores  (p)  in  each  of  the  nine  strata, and  the 
values  of  Cronbach’s  alpha  internal  consistency  coefficient  for  each  of  the 
24-item  strata.  The  strata  contained  items  of  steadily  increasing  difficulty: 
Stratum  1  contained  the  easiest  items  and  Stratum  9  contained  the  most  dif¬ 
ficult  items.  This  distribution  of  items  was  mirrored  in  the  proportion- 
correct  data  for  each  stratum.  The  average  proportion  correct  decreased  as 
difficulty  level  of  items  increased.  An  exception  to  this  tendency  occurred 
for  Strata  8  and  9,  in  which  average  proportion  correct  was  very  similar.  Al¬ 
though  average  proportion  correct  was  related  to  the  item  difficulties  in 
accordance  with  expectations,  the  data  on  the  range  of  individual  propor¬ 
tion-correct  scores  shows  considerable  variability  in  proportion  correct 
within  each  of  the  nine  strata.  The  largest  range  of  proportion  correct 
was  in  Stratum  4  where  at  least  one  student  answered  only  .04  of  the  items 
correctly  and  the  maximum  observed  proportion  correct  was  1.0.  The  small¬ 
est  range  of  observed  proportion  correct  was  for  Stratum  9,  in  which  the 
minimum  proportion-correct  score  was  .04  and  the  maximum  was  .79.  These 
data  suggest  a  wide  range  of  individual  differences  in  the  proportion-cor¬ 
rect  scores  for  each  stratum  and  consequently  the  potential  for  individual 
differences  in  observed  PRCs. 


Table  1 

Mean,  Standard  Deviation,  and  Range  of  Item  Difficulties  ( b ) 
and  Proportion-Correct  Scores  (p) ,  and  Cronbach’s  Alpha 
_ Coefficient  for  Each  of  the  Nine  Strata _ 

Item  Difficulties  ( h )  Proportion  Correct  (p) 


Range  Range 


Stratum 

Mean 

SD 

Min 

Max 

Mean 

SD 

Min 

Max 

Alpha 

1 

-2.40 

.31 

-2.97 

-1.97 

.866 

.146 

.130 

1.000 

.82 

2 

-1.55 

.20 

-1.93 

-1.25 

.794 

.186 

.210 

1.000 

.85 

3 

-1.01 

.14 

-1.24 

-.77 

.713 

.216 

.170 

1.000 

.86 

4 

-.56 

.13 

-.76 

-.37 

.615 

.202 

.040 

1.000 

.80 

5 

-.15 

.11 

-.36 

.01 

.545 

.209 

.080 

1.000 

.81 

6 

.26 

.13 

.06 

.47 

.481 

.210 

.040 

.960 

.80 

7 

.75 

.19 

.51 

1.12 

.416 

.197 

.080 

.960 

.78 

8 

1.32 

.12 

1.13 

1.52 

.330 

.135 

.040 

.880 

.54 

9 

1.98 

.37 

1.52 

2.67 

.334 

.124 

.040 

.790 

.44 

■  1  '  — 


-----  - 


■  . .  _  . 
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Table  1  shows  that  the  alpha  Internal  consistency  coefficients  for  Strata 
1  through  7  were  fairly  high  and  quite  similar,  ranging  from  .78  to  .86.  Alpha 
coefficients  for  Strata  8  and  9  were  lower — .54  and  .44,  respectively.  The  low 
alphas  for  Strata  8  and  9  were  likely  due  to  large  amounts  of  random  guessing 
for  most  students  as  the  average  porportion  of  correct  responses  of  .33  for  the 
two  strata  approached  the  theoretical  expectation  of  .20  for  the  five-alterna¬ 
tive  multiple-choice  items. 


Table  2 


Means  and  Standard  Deviations  of  Item  Difficulties  (£)  and 
Proportion-Correct  Scores  (p)  in  Each  of  the  Nine  Pairs 
_____ _ of  Parallel  Substrata  (A,  B) _ 


S5 - - - - . .  .  . 

Stratum 

Item 

Difficulties 

(£) 

Proportion 

Correct 

(P) 

Substratum 

Substratum 

A 

B 

A 

B 

Mean 

SD 

Mean 

SD 

Mean 

SD 

Mean 

SD 

1 

-2.320 

.296 

-2.475 

.317 

.850 

.156 

.880 

.166 

2 

-1.606 

.240 

-1.497 

.153 

.753 

.202 

.832 

.198 

3 

-1.017 

.161 

-.993 

.126 

.713 

.224 

.711 

.239 

4 

-.549 

.137 

-.572 

.130 

.622 

.222 

.606 

.218 

5 

-.123 

.084 

-.180 

.120 

.545 

.218 

.543 

.244 

6 

.296 

.129 

.219 

.134 

.460 

.214 

.501 

.246 

7 

.762 

.208 

.740 

.177 

.401 

.202 

.429 

.234 

8 

1.290 

.113 

1.354 

.126 

.341 

.163 

.317 

.157 

9 

2.043 

.411 

1.910 

.334 

.295 

.150 

.370 

.160 

Table  2  provides  data  on  each  of  the  nine  pairs  of  parallel  substrata  of 
12  items  each,  including  the  means  and  standard  deviations  of  item  difficulties 
( £ )  and  proportion-correct  scores  (p) .  Proportion-correct  scores  for  each  of 
the  151  students  on  each  of  the  18  substrata  are  in  Appendix  Table  C,  along 
with  total  proportion  correct  and  the  estimated  ability  level  for  each  student. 
As  Table  2  indicates,  the  substrata  contained  parallel  items  in  the  sense  of 
similar  means  and  standard  deviations  of  difficulties.  The  smallest  difference 
in  mean  difficulty  was  £>=.002  for  Stratum  3;  the  largest  difference  was  £=.155 
for  Stratum  1,  with  a  mean  difference  of  .07.  The  proportion  correct  obtained 
by  the  students  on  the  substrata  were  also  fairly  equal  in  mean  and  standard 
deviation.  The  smallest  difference  in  mean  proportion  correct  for  the  paired 
substrata  was  p=.002  for  Stratum  3  and  Stratum  5;  the  largest  difference  in 
mean  observed  proportion  correct  for  the  paired  substrata  was  .075  (Stratum  2), 
indicating  a  high  degree  of  similarity  in  mean  proportion  correct  for  the  sub¬ 
strata. 

Table  3  shows  the  estimated  alpha  coefficients  for  the  12-item  substrata 
and  the  parallel  forms  correlations  obtained  by  correlating  proportion-correct 
scores  for  the  151  students  on  each  of  the  nine  pairs  of  substrata.  The  esti¬ 
mated  12-item  alphas  were  obtained  using  the  Spearman-Brown  formula  from  the 
24-item  alphas  for  the  strata  shown  in  Table  1;  these  values  were  used  in  cor¬ 
recting  for  attenuation  the  parallel  forms  correlations.  As  Table  3  shows,  the 
uncorrected  parallel  forms  correlations  between  pairs  of  substrata  ranged  from 
.63  to  .74  for  the  first  seven  strata;  for  the  two  most  difficult  strata  the 
correlations  were  .42  and  .28.  These  correlations  were  fairly  substantial, 
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considerlng  the  low  Internal  consistency  reliabilities  for  the  two  most  diffi¬ 
cult  strata.  Using  the  Spearman-Brown  formula  to  correct  the  parallel  forms 
correlations  based  on  two  12-item  tests  to  the  24-item  length  of  the  strata, 
the  average  corrected  correlation  between  the  pairs  of  Substrata  1  through  7 
was  slightly  above  .80.  For  the  most  difficult  two  strata,  the  corrected  cor¬ 
relations  were  .59  and  .44. 


Table  3 

Estimated  Alpha  Coefficients  for  12-Item  Substrata, 
and  Parallel  Forms  Correlation  of  Proportion-Correct 
Scores — Uncorrected,  Corrected  by  Spearman-Brown 
Formula,  and  Corrected  for  Attenuation — on  Each  of 
the  Nine  Pairs  of  Parallel  Substrata 


Stratum 

Estimated 

12-Item 

Alpha 

Parallel  Forms  Correlation 

Spearman- 

Brown 

Uncorrected  Corrected 

Attenuation 

Corrected 

1 

.69 

.64 

.78 

.93 

2 

.74 

.74 

.85 

1.00 

3 

.75 

.73 

.84 

.97 

4 

.67 

.70 

.82 

1.04 

5 

.68 

.64 

.78 

.94 

6 

.67 

.66 

.80 

.99 

7 

.64 

.63 

.77 

.98 

8 

.40 

.42 

.59 

1.00 

9 

.28 

.28 

.44 

1.00 

To  determine  whether  scores  on  the  paired  substrata  correlated  as  highly 
as  possible,  given  the  reliabilities  of  the  substrata,  the  estimated  12-item 
alphas  for  the  substrata  were  used  along  with  the  uncorrected  parallel  forms 
correlation  to  estimate  the  correlation  between  proportion-correct  scores  on 
the  paired  substrata,  assuming  that  the  substrata  had  been  perfectly  reliable. 
These  attenuation-corrected  correlations  are  shown  as  the  last  column  in  Table 
3.  As  the  data  show,  attenuation-corrected  correlations  were  .97  or  above 
for  seven  of  the  nine  strata;  for  Strata  1  and  5,  these  correlations  were  .93 
and  .94,  respectively.  These  data  indicate  that  the  paired  substrata  scores 
were  as  parallel  as  possible,  given  their  estimated  internal  consistencies. 

Reliability  of  Observed  PRCs 

Within-  and  between-persons  D2  indices.  Table  4  shows  summary  statistics 
for  the  within-persons  D2  on  the  parallel  substrata  and  the  between-persons  D2 
using  randomly  paired  individuals.  The  within-persons  D2  mean  of  .28,  with  a 
standard  deviation  of  .15  and  range  of  .02  to  .86,  were  all  relatively  small. 
These  data  indicate  that  for  the  within-persons  D2,  the  average  difference  in 
proportion  correct  on  the  paired  substrata  was  about  p=.18.  By  comparison, 
the  between-persons  D2  mean  was  .75,  with  a  standard  deviation  of  .66  and  a 
range  of  .07  to  4.09.  Thus,  the  average  difference  in  proportion  correct  be¬ 
tween  randomly  paired  individuals  was  about  p-. 29. 


Percent 
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Table  4 

Mean,  Standard  Deviation,  and  Range  of  Within-  and  Between- 
Persons  D2  Indices,  and  Results  of  t  Tests  Comparing 
the  Mean  Within-Persons  D2  with  Each  Between-Persons  D2  Index 


D2  Index 

N 

Mean 

Range 

SD  Min  Max 

t 

pJL 

Within-Persons 

150 

.28 

.15 

.02 

.86 

Between-Persons 

D(AA) 

75 

.70 

.64 

.08 

3.92 

7.64 

<.001 

D(BB) 

75 

.78 

.68 

.06 

4.32 

8.62 

<.001 

D(AB) 

75 

.76 

.69 

.11 

4.50 

8.14 

<.001 

D(BA) 

75 

.76 

.62 

.04 

3.60 

9.06 

<.001 

*Probability  of  error  in  rejecting  null  hypothesis  of  no 
difference  in  group  means. 


The  smaller  within-persons  D2  demonstrates  greater  split-half  profile  simi¬ 
larity  within  persons  than  between  pairs  of  randomly  selected  persons,  irrespec¬ 
tive  of  which  split-half  test  was  used  for  the  between-persons  comparisons.  The 
t-test  statistics  in  Table  4  demonstrate  this  sizable  difference  between  the  two 
types  of  profile  comparison.  Although  the  t-test  assumption  of  independent 
groups  was  violated  in  these  data,  the  mean  differences  in  each  case  were  sub¬ 
stantial  enough  to  support  the  conclusion  that  the  PRCs  are  reliable. 

Figure  4  provides  further  data  on  the  distribution  of  the  D2  indices  in 
terms  of  the  relative  frequency  distributions  of  the  within-persons  reliability 
D2  and  two  of  the  between-persons  D2  indices  [D(AA)  and  D(BB)].  As  Figure 


Figure  4 

Relative  Frequency  Distributions  of  Within-  and  Between-Persons 
D2  Indices  for  Observed  Split-Half  Person  Response  Curves  (PRCs) 
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4  shows,  there  was  little  overlap  between  the  two  distributions.  Virtually  all 
of  the  within-persons  D2  values  were  below  .75;  and  the  distribution  was  highly 
peaked  with  a  modal  value  very  close  to  zero,  indicating  that  the  observed  PRCs 
from  the  split  substrata  were  very  similar  for  most  of  the  150  students.  By 
contrast,  although  the  mode  of  between-persons  D2  indices  was  similar  to  that 
of  the  within-persons  D2,  the  relative  frequency  associated  with  that  mode  was 
considerably  less  than  that  of  the  within-persons  distribution,  and  the  distri¬ 
butions  of  the  between-persons  data  was  considerably  less  peaked. 

Chi-square  tests  of  independence ■  Results  of  the  chi-square  test  of  inde¬ 
pendence,  based  on  a  2  x  9  contingency  table  with  number-correct  scores  on  each 
of  the  nine  pairs  of  parallel  substrata  for  each  student,  are  shown  in  Figure  5. 
The  minimum  value  of  chi-square  was  .14  and  the  maximum  was  12.94;  mean  of  the 
distribution  was  3.67,  with  a  standard  deviation  of  2.34.  A  chi-square  value  of 
15.51  is  statistically  significant  at  the  p=. 05  level  with  8  degrees  of  freedom 
(from  the  2x9  contingency  table).  Since  all  of  the  individual  chi-square 
values  were  less  than  15.51,  the  data  show  that  the  two  split-pool  observed  PRCs 
for  all  students  were  not  significantly  different  from  each  other,  further 
supporting  the  D2  data  which  indicated  that  the  observed  PRCs  obtained  from 
these  data  were  reliable. 


Figure  5 

Frequency  Polygon  of  Rounded  Intra-Individual 
Chi-Square  Test  of  Independence  Values  Between  Person 
Response  Curves  (PRCs)  for  the  Nine  Pairs  of  Parallel  Substrata 


•*  < 
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PRCs  and  Person-Fit 

The  frequency  distribution  of  individual  chi-square  values  reflecting  the 
fit  of  observed  PRCs  to  the  PRCs  expected  from  the  three-parameter  ICC  model 
is  shown  in  Figure  6.  The  lowest  chi-square  value  obtained  was  1.88  and  the 
highest  was  23.17.  Mean  chi-square  was  8.76,  with  a  standard  deviation  of  A. 14; 
modal  value  was  about  6.0. 

Since  ability  estimates  used  in  calculating  the  theoretically  expected 
proportion-correct  scores  were  taken  from  the  data  being  analyzed,  an  extra 
degree  of  freedom  was  subtracted  to  determine  the  significance  of  the  chi- 
square  values.  Thus,  with  7  degrees  of  freedom,  a  chi-square  value  of  1A.07  is 
significant  at  the  .05  level.  The  group  mean  chi-square  was  well  below  this 
value,  which  would  suggest  that  the  three-parameter  logistic  ICC  model  served 
as  a  fairly  good  predictor  of  test  response  behavior  for  the  majority  of  this 
group  of  students.  Of  151  students,  8  would  be  expected  to  have  significant 
chi-square  values  by  chance  alone  at  the  .05  level;  in  this  group,  15  students 
had  chi-square  values  greater  than  1A.07. 

Figure  6 

Frequency  Distribution  of  Intra-Individual  Chi-Square 
Values  for  Goodness  of  Fit  Between  Observed  and 
Expected  Person  Response  Curves  (PRCs) 


Chi-Square  from  Odd-Numbered  Substrata 


-22- 


To  identify  persons  reliably  deviating  from  the  model,  the  chi-square 
person-fit  statistics  were  recomputed  for  each  student  separately  on  the  two 
sets  of  substrata.  The  joint  distribution  of  chi-square  values  for  the  151 
students  is  shown  in  Figure  7,  with  the  .05  significance  level  indicated  by 
the  dashed  horizontal  and  vertical  lines.  Persons  in  the  upper  right-hand 
quadrant  were  identified  as  those  deviating  significantly  from  the  expected 
values,  with  p=.0025.  As  Figure  7  shows,  six  students  had  significant  chi- 
square  values  for  both  pairs  of  substrata  and  were  thus  placed  in  the  upper 
right-hand  quadrant.  Of  these  six,  four  were  also  significantly  non-fitting 
on  the  overall  chi-square  goodness-of  fit  test.  These  four  are  indicated  in 
Figure  7  by  their  subject  numbers,  and  their  PRCs  (both  observed  and  expected) 
are  in  Figure  9.  Persons  83,  111,  138,  and  117  might  be  hypothesized  to  have 
reliably  non-fitting  PRCs.  Of  the  15  students  whose  overall  chi-square  values 
were  statistically  significant,  those  not  included  in  the  upper  right-hand 
quadrant  may  be  hypothesized  to  be  non  -fitting  only  by  chance. 


Figure  7 

Joint  Distribution  of  Intra-Individual  Chi-Square  Values  for 
Goodness  of  Fit  for  Odd-  and  Even-Numbered  Substrata 
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Figure  8  shows  observed  and  expected  PRCs  for  students  with  low  overall 
chi-square  person-fit  values.  Person  128  (Figure  8a)  obtained  the  lowest  chi- 
square  value  among  the  151  students  tested.  As  Figure  8a  shows,  the  observed 
PRC  for  Person  128  (solid  line)  was  quite  close  to  the  expected  PRC  (dashed 
line)  for  each  of  the  nine  strata.  Figures  8b  through  8d  show  expected  and 
observed  PRCs  for  three  other  students  for  whom  model-fit  was  quite  good,  as 
indicated  by  the  low  chi-square  values,  although  as  expected,  some  minor 
deviations  from  model-fit  appeared  (e.g..  Figure  8d)  as  chi-square  values 
increased . 

Figure  9  shows  PRC  person-fit  results  for  four  of  the  persons  identified 
in  Figure  7  as  not  reliably  fitting  the  ICC  model;  these  data  are  based  on 
their  total  PRCs.  The  ways  in  which  these  four  students'  response  curves  de¬ 
viated  from  their  expected  curves  differed  widely.  Person  111  (Figure  9a) 
seems  to  have  been  careless  with  easier  items,  as  indicated  by  a  proportion 
correct  of  .75  on  items  in  Stratum  1,  and  then  to  have  been  fortunate  in  guess¬ 
ing  on  some  of  the  more  difficult  ones  (p=.50  on  Stratum  7).  On  the  other  hand 
this  may  be  the  type  of  profile  to  be  expected  from  a  person  with  an  unusual 
educational  history,  such  as  an  international  student  with  a  specialized  knowl¬ 
edge  of  English.  Person  117  (Figure  9b)  and  Person  138  (Figure  9d)  seemed  to 
have  done  much  better  on  difficult  items  than  was  predicted  by  the  model; 
these  students  might  be  sophisticated  at  guessing  or  high  in  "testwiseness." 
Person  83  (Figure  9c)  seems  to  have  exhibited  carelessness  on  the  easier  items 
(Stratum  1)  but  more  effort  (with  perhaps  some  good  guesses)  on  the  more  dif¬ 
ficult  items  in  Strata  6  through  8. 

Although  these  figures  demonstrate  lack  of  fit  of  individuals  to  the 
model-based  predictions,  they  do  not  by  themselves  point  to  clear  interpreta¬ 
tions.  However,  they  do  illustrate  some  of  the  different  ways  in  which  signi¬ 
ficant  deviations  in  test  data  can  occur.  This  demonstrates  the  need  for 
methods  of  assessing  and  interpreting  the  many  ways  in  which  non-fitting  PRCs 
may  occur. 

PRCs  and  Ability  Level 

Additional  data  supporting  the  overall  fit  of  persons  to  the  three-para¬ 
meter  ICC  model  are  shown  in  Table  5.  Table  5  summarizes  the  distributions  of 


Table  5 

Mean,  Standard  Deviation,  and  Range  of  Within-Persons  D2 
and  ^-Matched  Between-Persons  D2,  and  Results  of  t  Tests 
Comparing  the  Mean  Within-Persons  D2  with  Each 

Between-Persons  D2  Index  _ 


D2  Index 

N 

Mean 

Range 

SD  Min  Max 

t 

P* 

Within-Persons 

150 

.28 

.15 

.02 

.86 

Between-Persons 

D(AA) 

75 

.25 

.11 

.03 

.48 

1.50 

<.20 

D(BB) 

75 

.26 

.13 

.05 

.74 

1.00 

<.50 

D(AB) 

75 

.29 

.14 

.05 

.79 

0.50 

<.80 

D(BA) 

75 

.28 

.12 

.07 

.64 

0.00 

1.00 

*Probability  of  error  in  rejecting  null  hypothesis  of  no  dif¬ 
ference  in  group  means  (two-tailed  test) . 


Proportion  Correct  ( p )  i-*  Proportion  Correct  (p) 


Figure  8 

Observed  and  Expected  Person  Response  Curves  (PRCs)  for  Four 
Persons  Whose  Responses  Reliably  Fit  the  Three-Parameter  ICC  Model 


—  Observed  PRC 
.——Expected  PRC 


(a)  Person  128  (x  =1.88) 


(b)  Person  135  (x2“1.92) 
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12345678 
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(c)  Person  20  (x2=2.96) 


(d)  Person  133  (x2=3.35) 


23456  789 

Stratum  I 

(Difficulty  Level) 


12345678 

Stratum 

(Difficulty  Level) 


Proportion  Correc 


Figure  9 

Observed  and  Expected  Person  Response  Curves  (PRCs)  for  Four  Persons 
Whose  Responses  Did  Not  Reliably  Fit  the  Three-Parameter  ICC  Model 

■  1  Observed  PRC 
———Expected  PRC 


(a)  Person  111  (x2=23.17) 
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(b)  Person  117  (X  =22.75) 
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(d)  Person  138  (x2=14.14) 
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between-persons  D2  data  on  parallel  substrata  when  students  were  matched  as 
closely  as  possible  for  9  values  before  the  substrata  D2  Indices  were  calculated. 
As  Table  5  shows,  none  of  the  mean  between-persons  D2  indices  was  significantly 
different  from  the  within-persons  D2;  in  three  of  the  four  cases  the  mean 
between-persons  D2  was  slightly  lower  than  the  mean  within-persons  D2.  In 
addition,  the  standard  deviations  and  ranges  of  the  two  kinds  of  D2  indices 
were  very  similar.  Thus,  the  data  in  Table  5  show  that  observed  PRCs  for  this 
group  of  students  were  highly  dependent  upon  their  ability  levels,  further 
supporting  the  fit  of  these  individuals  to  the  three-parameter  ICC  model. 

Correlates  of  Observed  PRCs 

Table  6  shows  intercorrelations  of  the  within-persons  D2  PRC  reliability 
index;  the  PRC  person-fit  chi-square  value  for  each  person;  ability  estimates 
(@);  and  the  Perceived  Test  Difficulty,  Test-Taking  Anxiety,  and  Test-Taking 
Motivation  scale  scores.  The  D2  reliability  indices  correlated  significantly 
(r*-.24)  with  ability,  indicating  a  tendency  for  lower  ability  students  to 
have  more  unreliable  PRCs.  D2  also  correlated  significantly  positively  with 
both  Perceived  Test  Difficulty  and  Test-Taking  Anxiety  scale  scores;  the  cor¬ 
relation  with  Perceived  Test  Difficulty  scores  probably  reflected  the  high 
negative  correlation  (r*=-.70)  between  ability  level  and  perceived  difficulty 
of  the  test  items.  The  correlation  of  r=.18  with  Test-Taking  Anxiety  suggests 
a  tendency  for  students  with  higher  test-taking  anxiety  to  have  less  reliable 
PRCs.  None  of  the  correlations  of  the  chi-square  person-fit  index  were  sta¬ 
tistically  significant.  Further  analysis  of  the  relationship  of  the  chi-square 
person-fit  indices  by  analysis  of  variance  indicated  no  nonlinear  relationships 
between  the  chi-square  index  and  the  Perceived  Test  Difficulty,  Test-Taking 
Anxiety, and  Test-Taking  Motivation  scale  scores. 


Table  6 

Intercorrelations  of  Ability  Estimates  (0),  Psychological 
Reactions  Scales,  and  PRC  Within-Persons  D2  and 


Ability 

Perceived 

Test 

Difficulty 

Test- 

Taking 

Anxiety 

Test-Taking 

Motivation 

Within- 

Persons 

D2 

Ability 

Perceived  Test  Difficulty 
Test-Taking  Anxiety 
Test-Taking  Motivation 
Within-Persons  D2 

-.70** 

-.16* 

.37** 

-.24** 

.28** 

-.35** 

.18** 

.11** 

.18** 

.03 

Person-Fit  Chi-Square 

-.06 

-.05 

.07 

.04 

-.04 

*Significant  at  pc. 05. 
**Signif leant  at  pc.Ol. 


These  results  are  consistent  with  the  previously  reported  findings  that 
the  three-parameter  logistic  model  seemed  to  predict  quite  well  the  test  perfor¬ 
mance  of  the  majority  of  the  students  in  this  sample.  Since  only  a  few  of  the 
students  deviated  significantly  and  reliably  from  the  predictions  from  the  model, 
it  would  be  impossible  to  find  strong  relationships  between  the  goodness-of-f it 
results  and  other  variables.  Furthermore,  as  was  illustrated  in  Figure  9,  there 
are  many  possible  ways  of  deviating  from  the  model  and, consequently,  there  may  be 
many  correlates  of  such  deviations. 
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Conalusions  and  Directions  for  Future  Besearah 

The  feasibility  of  the  person  response  curve  (PRC)  approach  to  investi¬ 
gating  the  fit  of  persons  to  the  three-parameter  ICC  model  was  explored  in  this 
study.  To  operationalize  the  PRC  it  was  necessary  to  subdivide  ability  test 
items  into  separate  strata  of  varying  difficulty  levels.  For  the  vocabulary 
test  used  in  this  study,  strata  possessed  sufficient  internal  consistency  and 
parallel  forms  reliability  to  justify  their  use,  although  the  more  difficult 
strata  were  much  less  reliable  than  the  easier  strata. 

Conalusions 

The  PRCs  proved  to  be  highly  reliable.  The  Dz  analyses  indicated  not  only 
that  intra-individual  profiles  were  more  similar  than  profiles  between  randomly 
selected  persons  but  also  that  profiles  between  people  of  similar  ability  level 
were  also  very  similar.  As  additional  evidence  of  profile  reliability,  chi- 
square  tests  for  independence  between  profiles  of  parallel  forms  for  each  indi¬ 
vidual  were  nonsignificant  for  all  151  students.  The  high  correlation  of  r*=.82 
(p<.001)  between  the  intra-individual  parallel  forms  chi-squares  and  the  D' 
suggests  that  the  chi-square  test  may  be  sufficient  in  future  studies,  since  it 
also  provides  a  more  ready  means  of  assessing  statistical  significance. 

The  results  of  the  D2  statistics  between  individuals  matched  on  ability 
level  were  interesting,  since  they  illustrated  close  profile  similarity  between 
different  persons  of  similar  ability  level.  This  suggests  that  for  the  majority 
of  this  sample,  PRCs  were  predictable  as  a  function  of  ability  level.  A  more 
complete  test  of  this  hypothesis  was  conducted  with  a  chi-square  goodness-of- 
fit  test  between  observed  proportion-correct  scores  on  each  of  nine  strata  and 
expected  proportion-correct  scores  predicted  by  the  three-parameter  logistic 
model.  The  nonsignificant  group  mean  suggests  that  the  model  was  a  reasonable 
way  of  describing  students’  test  response  behavior. 

At  the  .05  level,  eight  students  were  expected  to  have  significant  chi- 
square  goodness-of-f it  values  for  observed  and  expected  PRCs.  Fifteen  students 
had  significant  chi-square  values,  leaving  somewhat  in  question  whether  these 
students  deviated  from  the  model  because  of  chance  or  interaction  with  another 
dimension.  One  method  of  investigating  this  question  was  to  calculate  sepa¬ 
rately  the  goodness  of  fit  of  each  student's  observed  and  expected  PRCs  on  the 
odd-numbered  substrata  and  on  the  even-numbered  substrata.  Of  the  15  students 
with  significant  chi-square  values  on  the  overall  nine-strata  goodness-of-f it 
test,  four  had  significant  chi-square  values  on  both  substrata  goodness-of-f it 
tests.  These  four  students  were  identified  as  reliably  deviating  from  the  ICC 
model  predictions.  The  nature  of  this  lack  of  fit,  however,  would  best  be 
investigated  in  a  future  study  with  an  experimental  design  that  included  inter¬ 
actions  with  additional  dimensions  other  than  the  ability  being  measured. 

Having  demonstrated  the  goodness  of  fit  of  observed  PRCs  with  model- 
predicted  PRCs,  and  with  no  firm  evidence  to  suggest  that  significant  results 
for  a  majority  of  the  students  were  due  to  anything  other  than  chance,  the 
nonsignificant  results  for  the  relationship  of  the  goodness-of-f it  chi-square 
variable  with  nontest  variables  seems  to  follow.  Scores  on  the  psychological 
reactions  scales  correlated  with  each  other  and  with  ability  estimates  in 
expected  ways  but  did  not  correlate  significantly  with  the  overall  chi-square 
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variable.  These  results  substantiated  the  fit  of  the  model  to  observed  student 
test-response  behavior.  The  psychological  reactions  scales  could  be  used  in  a 
future  study  of  non-fit  in  which  these  psychological  states  could  be  experi¬ 
mentally  induced. 

The  results  of  this  study  demonstrate  that  the  PRC  can  be  useful  in  study¬ 
ing  the  fit  of  individuals  to  ICC  models  by  testing  the  fit  of  the  observed 
PRC  to  the  theoretically  expected  PRC.  Although  the  three-parameter  ICC  model 
was  used  here,  the  method  can  be  used  with  the  two-parameter  or  one-parameter 
logistic  (Rasch)  model  or  with  any  of  the  normal  ogive  ICC  models.  The  data 
also  demonstrated  that  the  three-parameter  ICC  model  adequately  accounted  for 
the  test  response  behavior  of  the  vast  majority  of  the  students  studied.  More 
research  is,  of  course,  necessary  to  further  explore  the  use  of  the  PRC  in 
examining  model-fit  in  test  behavior. 

Directions  for  Future  Research 

Guessing  and  "testwiseness"  are  variables  which  are  unrelated  to  abilities 
but  may  affect  ability  test  scores.  To  determine  whether  these  variables  can 
be  detected  by  PRCs  or  PRC-fit  to  theoretical  predictions,  a  useful  experiment 
would  be  to  administer  a  multiple-choice  ability  test  along  with  testwiseness 
and  guessing  scales  to  groups  of  students.  One  subgroup  in  the  experimental 
design  should  be  an  experimental  group  trained  in  testwiseness  and/or  in 
guessing  skills.  The  effects  of  testwiseness  or  guessing  would  be  studied  by 
analysis  of  the  chi-square  goodness-of-f it  statistics  comparing  the  expected 
and  observed  PRCs  for  the  experimental  and  control  groups.  Special  attention 
should  be  given  to  chi-square  values  on  the  most  difficult  items  in  the  ability 
test  rather  than  overall  chi-squares,  since  it  is  on  these  items  that  the 
experimental  effect  is  likely  to  be  observed. 

Cultural  bias  is  another  dimension  which  may  differentially  affect  ability 
test  performance  (e.g.,  Church,  Pine,  &  Weiss,  1978;  Martin,  Pine,  &  Weiss, 

1978;  Pine  &  Weiss,  1978).  One  approach  to  testing  for  the  existence  of  such 
bias  by  use  of  PRCs  would  be  to  compare  the  goodness  of  fit  of  observed  and 
expected  PRCs  for  a  control  group  of  white  middle-class  testees  and  a  group  of 
testees  who  would  be  hypothesized  to  have  uneven  educational  development  by 
white  middle-class  American  standards.  This  latter  group  might  involve  inter¬ 
national  students  with  a  specialized  knowledge  of  the  English  language  or  some 
American  minority  group  persons.  It  would  be  expected  that  the  PRCs  would  show 
greater  deviation  from  the  model  predictions  for  the  latter  group,  particularly 
in  terms  of  deviations  from  the  unidimensionality  required  by  the  ICC  model. 

Carelessness  and  nervousness  are  two  other  dimensions  which  may  contribute 
to  unexpected  performance  on  ability  tests  and  which  may  be  detected  by  PRC 
analysis.  To  study  the  effect  of  these  dimensions  on  person-fit,  an  ability 
test  could  be  administered  to  three  groups  of  randomly  selected  individuals 
from  the  same  population.  A  low-motivation-possibly-careless  control  group 
would  be  given  minimal  information  about  the  test.  Treatment  Group  1  would  be 
told  that  the  test  results  did  not  matter  and  that  the  experimenter  just  needed 
to  fill  his/her  quota  of  subjects.  Treatment  Group  2  would  be  told  that  the 
test  is  an  important  determiner  of  whether  or  not  they  would  be  able  to  complete 
college  or  to  succeed  in  some  occupation;  this  would  be  considered  the  high- 
anxiety  group.  The  experimentally  induced  states  should  be  verified  with 
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improved  versions  of  the  psychological  reactions  scales  for  motivation  and 
anxiety  used  in  this  report.  Values  comparing  chi-square  observed  versus  ex¬ 
pected  PRCs  would  be  compared,  with  special  attention  to  the  PRC-fit  data  on 
the  easier  strata  for  the  low-motivation  group  and  on  the  more  difficult  strata 
for  the  high-anxiety  group.  This  would  give  information  on  possible  psycholog¬ 
ical  correlates  of  fit  on  a  stratum-by-stratum  basis.  Data  of  this  type  might 
be  used,  for  example,  to  investigate  the  operation  of  the  Yerkes-Dodson  Law 
(Taylor  &  Spence,  1958;  Yerkes  &  Dodson,  1908)  in  ability  test  data;  PRC-fit 
data  would  support  this  hypothesis  if  high-anxiety  testees  perform  better  than 
expected  on  easy  test  items  and  more  poorly  than  expected  on  the  more  difficult 
test  items. 


Further  investigation  of  the  measurement  properties  of  observed  versus 
expected  chi-square  goodness-of-f it  statistics  for  assessing  non-fit  of  persons 
is  also  of  importance.  Monte  carlo  simulations  should  be  run  in  order  to 
determine  the  null  distribution  of  the  chi-square  values.  These  should  be 
repeated  at  a  number  of  theta  levels  to  determine  whether  goodness-of-f it  dis¬ 
tributions  differed  as  a  function  of  ability  level. 


Finally,  since  the  research  literature  on  methods  for  assessing  non-fitting 
profiles  has  begun  to  branch  in  several  different  directions,  it  would  be  infor¬ 
mative  and  useful  to  compare  the  efficacy  of  several  different  methods  using 
the  same  data  base.  The  one-,  two-,  and  three-parameter  ICC  models  could  each 
be  used  in  computing  ability  estimates  so  that  non-fit  measures  based  on  these 
different  models  could  be  used.  This  would  best  be  done  in  simulation,  with 
non-fitting  data  experimentally  induced  so  that  the  different  methods  of  evalu¬ 
ating  model-fit  could  be  compared  on  their  degree  of  "hits"  and  "misses." 


These  are  only  a  few  of  many  research  possibilities  in  investigating  the 
properties  and  the  diagnostic  utility  of  PRCs.  A  closer  look  at  these  proper¬ 
ties  of  the  PRC  test  performance  profiles  and  their  use  in  determining  person- 
fit  may  provide  important  information  on  selected  individuals  and  improve  the 
validity  of  ability  tests  for  individual  prediction  and  diagnosis. 
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Table  B 

Test  Reaction  Items  Used  for  the  Perceived  Test  Difficulty  and  Test-Taking 
Anxiety  and  Motivation  Scales,  and  Scoring  Weights  for  Each  Response 
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Table  C 

Ability  Estimate  (0),  Total  Proportion  Correct  (T),  and  Proportion  Correct  for 
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